15 research outputs found

    Dynamisch adaptive Mikroarchitekturen mit optimierten Speicherstrukturen und variablen Befehlssätzen

    Get PDF
    Die Anforderungen, die an Mikroarchitekturen gestellt werden, steigen stetig, ein Großteil der technologischen Innovationen der letzten Jahrzehnte ist erst durch den Fortschritt der Halbleiterindustrie und den damit verbundenen Performanzsteigerungen integrierter Schaltkreise möglich geworden. Eine weitere Performanzsteigerung integrierter Schaltkreise ist durch das Erreichen von physikalischen Grenzen nicht mehr selbstverständlich. Es müssen neue Architekturen entworfen werden, um an diesem Punkt auch weiterhin die steigenden Anforderungen erfüllen zu können. Im Rahmen dieser Arbeit wurden flexible Mikroarchitekturen entworfen und evaluiert, die applikationsspezifisch verschiedene Parameter der Architektur optimieren. Die entworfenen Mikroarchitekturen erfüllen die gestellten Anforderungen durch eine neuartige und effiziente Nutzung der vorhandenen Ressourcen. Es wurde eine inhalts-adaptive Speicherstruktur entworfen, welche für eine effiziente Verarbeitung von im Voraus analysierten Daten ausgelegt ist. Das entworfene Konzept bleibt durch die automatische Generierung flexibel einsetzbar und ist adaptierbar. Das System zeigt zudem das Potential auf, welches in einer Verschiebung der Komplexität des Anwendungsfalls zur Laufzeit auf Analysen im Vorfeld liegt. Ein weiterer Ansatz ist das Konzept einer transparenten und dynamischen Hardwarebeschleunigung eines adaptiven Prozessors. Für die Realisierung wurde ein Automatismus entworfen und dem Prozessor zur Verfügung gestellt, mit der dieser eigenständig zur Laufzeit rechenintensive Kernel detektieren und beschleunigen kann. Auf diese Weise verbindet der adaptive Prozessor nicht nur die Generalität eines Allzweck-Prozessor mit der Flexibilität eines rekonfigurierbaren Systems, sondern ist zusätzlich in der Lage unabhängig vom Softwareentwickler oder Compiler Anwendungen zur Laufzeit zu beschleunigen. Dies führt zu einer eigenständigen Anpassungsfähigkeit des Prozessors an die Anwendung und ermöglicht somit eine Performanzsteigerung eines Kernels, welcher während der Entwicklungsphase des Prozessors nicht berücksichtigt worden ist. Zusammenfassend kann gezeigt werden, dass mit relativ geringem Entwicklungsaufwand leistungsstarke und flexible Mikroarchitekturen entworfen und realisiert werden können, wenn ein Hauptaugenmerk auf die effiziente Nutzung der vorhandenen Ressourcen gelegt wird

    A Hardware Perspective on the ChaCha Ciphers: Scalable Chacha8/12/20 Implementations Ranging from 476 Slices to Bitrates of 175 Gbit/s

    Get PDF
    AES (Advanced Encryption Standard) accelerators are commonly used in high-throughput applications, but they have notable resource requirements. We investigate replacing the AES cipher with ChaCha ciphers and propose the first ChaCha FPGA implementations optimized for data throughput. In consequence, we compare implementations of three different system architectures and analyze which aspects dominate the performance of those.Our experimental results indicate that a bandwidth of 175 Gbit/s can be reached with as little as 2982 slices, whereas comparable state of the art AES accelerators require 10 times as many slices. Taking advantage of the flexibility inherent in the ChaCha cipher, we also demonstrate how our implementation scales to even higher throughputs or lower resource usage (down to 476 slices), benefiting applications which previously could not employ cryptography because of resource limitations

    An Analytical Model of Configurable Systolic Arrays to find the Best-Fitting Accelerator for a given DNN Workload

    Get PDF
    Since their breakthrough, complexity of Deep Neural Networks (DNNs) is rising steadily. As a result, accelerators for DNNs are now used in many domains. However, designing and configuring an accelerator that meets the requirements of a given application perfectly is a challenging task. In this paper, we therefore present our approach to support the accelerator design process. With an analytical model of a systolic array we can estimate performance, energy consumption and area for each design option. To determine these metrics, usually a cycle accurate simulation is performed, which is a time-consuming task. Hence, the design space has to be restricted heavily. Analytical modelling, however, allows for fast evaluation of a design using a mathematical abstraction of the accelerator. For DNNs, this works especially well since the dataflow and memory accesses have high regularity. To show the correctness of our model, we perform an exemplary realization with the state-of-the-art systolic array generator Gemmini and compare it with a cycle accurate simulation and state-of-the-art modelling tools, showing less than 1% deviation. We also conducted a design space exploration, showing the analytical model’s capabilities to support an accelerator design. In a case study on ResNet-34, we can demonstrate that our model and DSE tool reduces the time to find the best-fitting solution by four or two orders of magnitude compared to a cycle-accurate simulation or state-of-the-art modelling tools, respectively

    EFFECT: An End-to-End Framework for Evaluating Strategies for Parallel AI Anomaly Detection

    Get PDF
    Neural networks achieve high accuracy in tasks like image recognition or segmentation. However, their application in safety-critical domains is limited due to their black-box nature and vulnerability to specific types of attacks. To mitigate this, methods detecting out-of-distribution or adversarial attacks in parallel to the network inference were introduced. These methods are hard to compare because they were developed for different use cases, datasets, and networks. To fill this gap, we introduce EFFECT, an end-to-end framework to evaluate and compare new methods for anomaly detection, without the need for retraining and by using traces of intermediate inference results. The presented workflow works with every preexisting neural network architecture and evaluates the considered anomaly detection methods in terms of accuracy and computational complexity. We demonstrate EFFECT\u27s capabilities, by creating new detectors for ShuffleNet and MobileNetV2 for anomaly detection as well as fault origin detection. EFFECT allows us to design an anomaly detector, based on the Mahalanobis distance as well as CNN based detectors. For both use cases, we achieve accuracies of over 85 %, classifying inferences as normal or abnormal, and thus beating existing methods

    CNNParted: An open source framework for efficient Convolutional Neural Network inference partitioning in embedded systems

    Get PDF
    Applications such as autonomous driving or assistive robotics heavily rely on the usage of Deep Neural Networks. In particular, Convolutional Neural Networks (CNNs) provide precise and reliable results in image processing tasks like camera-based object detection or semantic segmentation. However, to achieve even better results, CNNs are becoming more and more complex. Deploying these networks in distributed embedded systems thereby imposes new challenges, due to additional constraints regarding performance and energy consumption in the near-sensor compute platforms, i.e. the sensor nodes. Processing all data in the central node, however, is disadvantageous since raw data of camera consumes large bandwidth and running CNN inference of multiple tasks requires certain performance. Moreover, sending raw data over the interconnect is not advisable for privacy reasons. Hence, offloading CNN workload to the sensor nodes in the system can lead to reduced traffic on the link and a higher level of data security. However, due to the limited hardware-resources on the sensor nodes, partitioning CNNs has to be done carefully to meet overall latency requirements and energy constraints. Therefore, we present CNNParted, an open-source framework for efficient, hardware-aware CNN inference partitioning targeting embedded AI applications. It automatically searches for potential partitioning points in the CNN to find a beneficial workload distribution between sensor nodes and a central edge node. Thereby, CNNParted not only analyzes the CNN architecture but also takes hardware components, such as dedicated hardware accelerators and memories, into consideration to evaluate inference partitioning regarding latency and energy consumption. Exemplary, we apply CNNParted to three commonly used feed forward CNNs in embedded systems. Thereby, the framework first searches for several potential partitioning points and then evaluates the latter regarding inference latency and energy consumption. Based on the results, beneficial partitioning points can be identified depending on the system constraints. Using the framework, we are able to find and evaluate 10 potential partitioning points for FCN ResNet-50, 13 partitioning points for GoogLeNet, and 8 partitioning points for SqueezeNet V1.1 within 520 s, 330 s, and 140 s, respectively, on an AMD EPYC 7702P running 8 concurrent threads. For GoogLeNet, we determine two partitioning points that provide a good trade-off between required bandwidth, latency and energy consumption. We also provide insights into further interesting findings that can be derived from the evaluation results

    Message from IEEE SOCC Technical Chairs

    No full text

    Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications

    No full text
    Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1

    Towards the on-device Handwriting Trajectory Reconstruction of the Sensor Enhanced Pen

    No full text
    International audiencePerforming handwriting trajectory regression from inertial data using Deep Neural Network (DNN) on an embedded device is a very challenging task, since the network accuracy is prone to imperfections in the weights and needs a significant amount of parameters to be able to regress. In this work, we apply and compare different quantization techniques and Mitchell logarithmic multiplication approximation in order to enable the on-device inference. We show that it is possible to perform the inference of the TCN-based regression model using only 8-bit fixed-point quantization without significant reconstruction precision loss and that the accuracy degradation of the approximate multiplication can be partially compensated with Quantizationaware Training (QAT). Finally, we demonstrate that the compressed models can be integrated into an off-the-shelf commercial Systems-on-Chip with minimal use of FPU and requiring only 460 KB of the ROM size for the TCN-49 configuration

    KIHT: Kaligo-based Intelligent Handwriting Teacher

    No full text
    International audienceKaligo-based Intelligent Handwriting Teacher (KIHT) is a bi-nationally funded research project. The aim of this joint project is to develop an intelligent learning device for automated handwriting, composed of existing components, which can be made available to as many students as possible. With KIHT, we specifically address the challenging task of using inertial sensors to retrace the trajectory of a pen without relying on external reference systems. The nearly unlimited freedom to let the pen glide over the paper has not yet provided a satisfactory solution to this challenge in the state-of-the-art methods, even with sophisticated algorithms and AI approaches. Together with partners from industry and academia, we are taking a holistic approach by considering the entire chain of components, from the pen to the embedded processing system, the algorithms and the app
    corecore